Working directory is set and necessary libraries are loaded.

setwd("~/BRN/")

library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.5     v dplyr   1.0.7
## v tidyr   1.1.4     v stringr 1.4.0
## v readr   2.0.2     v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(ggplot2)
library(ggpubr)

Guided part

Read in the gapminder_clean.csv data as a tibble using read_csv.

gapminder <- read_csv("gapminder_clean.csv")
## New names:
## * `` -> ...1
## Rows: 2607 Columns: 20
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr  (2): Country Name, continent
## dbl (18): ...1, Year, Agriculture, value added (% of GDP), CO2 emissions (me...
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.

Filter the data to include only rows where Year is 1962 and then make a scatter plot comparing ‘CO2 emissions (metric tons per capita)’ and gdpPercap for the filtered data.

On the filtered data, calculate the pearson correlation of ‘CO2 emissions (metric tons per capita)’ and gdpPercap. What is the Pearson R value and associated p value?

gapminder %>% 
  filter(Year == 1962) %>%
  ggplot(aes(x = gdpPercap ,y = `CO2 emissions (metric tons per capita)`)) + geom_point() + stat_cor(method = "pearson")

We can see that, as expected, there was a very high, positive correlation between CO2 emissions and GDP per capita in 1962.

On the unfiltered data, answer “In what year is the correlation between ‘CO2 emissions (metric tons per capita)’ and gdpPercap the strongest?” Filter the dataset to that year for the next step…

gapminder %>%
  group_by(Year) %>%
  summarize(Correlation = cor(gdpPercap,`CO2 emissions (metric tons per capita)`,use = "complete.obs")) %>%
  slice_max(abs(Correlation))
## # A tibble: 1 x 2
##    Year Correlation
##   <dbl>       <dbl>
## 1  1967       0.939

The year with the strongest correlation between GDP per capita is 1967 (the second measured year, the first being 1962). We can see a consistent decrease after that date in the next graph, which may be partly due to increasing outsourcing of the secondary sector from first world countries to emerging countries. Another reason may be that economies with low GDP per capita have more margin for emission increase through industrialization, while services play a more central role in already developed economies.

gapminder %>%
  group_by(Year) %>%
  summarize(Correlation = cor(gdpPercap,`CO2 emissions (metric tons per capita)`,use = "complete.obs")) %>%
  ggplot(aes(x = Year,y = Correlation)) + geom_point()

Using plotly, create an interactive scatter plot comparing ‘CO2 emissions (metric tons per capita)’ and gdpPercap, where the point size is determined by pop (population) and the color is determined by the continent. You can easily convert any ggplot plot to a plotly plot using the ggplotly() command.

library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
p <- gapminder %>% 
  filter(Year == 1962) %>%
  ggplot(aes(x = gdpPercap ,y = `CO2 emissions (metric tons per capita)`)) + geom_point(aes(size = pop,colour = continent)) + stat_cor(method = "pearson") 
ggplotly(p)

We can see that African, American and Asian countries are at the bottom in both emissions and GDP per capita. European countries, with some exception, are more rich and carbonic anhydride emitting.The two American outliers must be Canada and the United States of America.

data_frame <- as.data.frame(gapminder)

Unguided part

What is the relationship between continent and ‘Energy use (kg of oil equivalent per capita)’? (stats test needed)

Before applying a statistical test we will visualize the distribution of energy use in each continent through the use of boxplots. We have used only the values for 2007 for the amount of outliers to be readily interpretable. You can hover your mose over a point in the interactive graph to see which country it is representing.

p <- gapminder %>%
  filter(!is.na(continent) & Year == 2007) %>%
  ggplot(aes(x = continent,y = `Energy use (kg of oil equivalent per capita)`)) + geom_boxplot() + geom_point(aes(color = `Country Name`)) + theme(legend.position = "none")
  
ggplotly(p)
## Warning: Removed 17 rows containing non-finite values (stat_boxplot).

Europe and Oceania are the most energy consuming countries, followed by Asia and the Americas. Asian outliers are rich countries, due to oil or otherwise. The three clear American outliers are the two rich North American countries and the industry intensive Trinidad & Tobago. Africa presents the lower values, with South Africa and resource rich Lybia and Equatorial Guinea as outliers.

If the data are normally distributed and homocedastic, an ANOVA test would be appropriate. Let’s check those assumptions.

gapminder %>%
  filter(!is.na(continent) & !is.na(`Energy use (kg of oil equivalent per capita)`)) %>%
  group_by(continent) %>%
  summarize(Shapiro =shapiro.test(`Energy use (kg of oil equivalent per capita)`)[2]$p.value)
## # A tibble: 5 x 2
##   continent  Shapiro
##   <chr>        <dbl>
## 1 Africa    2.33e-19
## 2 Americas  1.59e-21
## 3 Asia      4.94e-19
## 4 Europe    3.00e-12
## 5 Oceania   9.55e- 1

Only Oceania seems to be normally distributed. The null hypothesis (normality) is rejected. We shall thus use the Kruskal-Wallis test.

gapminder %>%
  filter(!is.na(continent) & !is.na(`Energy use (kg of oil equivalent per capita)`)) %>%
  summarise(kruskal = kruskal.test(`Energy use (kg of oil equivalent per capita)`~continent)$p.value)
## # A tibble: 1 x 1
##    kruskal
##      <dbl>
## 1 1.01e-67

The null hypothesis (equality of means) is clearly rejected.

This test only refutes the hypothesis that all means are equal but doesn’t tell us if the difference between two specific continents is significative. In order to see that we will use a pairwise Wilcoxon Test.

library(rstatix)
## 
## Attaching package: 'rstatix'
## The following object is masked from 'package:stats':
## 
##     filter
gapminder %>%
  filter(!is.na(continent) & !is.na(`Energy use (kg of oil equivalent per capita)`)) %>%
  pairwise_wilcox_test(`Energy use (kg of oil equivalent per capita)`~ continent) %>%
  select(group1,group2,p.adj.signif)
## # A tibble: 10 x 3
##    group1   group2   p.adj.signif
##    <chr>    <chr>    <chr>       
##  1 Africa   Americas ****        
##  2 Africa   Asia     ***         
##  3 Africa   Europe   ****        
##  4 Africa   Oceania  ****        
##  5 Americas Asia     ns          
##  6 Americas Europe   ****        
##  7 Americas Oceania  ****        
##  8 Asia     Europe   ****        
##  9 Asia     Oceania  ****        
## 10 Europe   Oceania  **

As we can see, all pairwise comparisons are significant except the difference between Asia and the Americas.

Is there a significant difference between Europe and Asia with respect to ‘Imports of goods and services (% of GDP)’ in the years after 1990? (stats test needed)

As in the previous section, we will visualize before applying a statistical test:

library(reshape2)
## 
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
## 
##     smiths
 p <- gapminder %>%
  filter(!is.na(`Imports of goods and services (% of GDP)`)& Year > 1990) %>%
  filter(continent == "Europe" | continent == "Asia") %>%
  group_by(Year) %>%
  summarise(mean_asia = mean(`Imports of goods and services (% of GDP)`[continent == "Asia"]),mean_europe = mean(`Imports of goods and services (% of GDP)`[continent == "Europe"])) %>%
  gather(key,value,mean_asia,mean_europe) %>%
  ggplot(aes(x = Year,y = value,colour = key)) + geom_point() + geom_line() + labs(y = "Imports of goods and services (% of GDP)")
 
ggplotly(p)

In the graph we can see that mean European country imports as % of GDP have been growing faster than mean Asian country imports as % of GDP, almost converging in 2007.

Since n > 30 we don’t need to check for normality, but we still have to check whether there is homogeneity of variances. We will use a Levene test for that.

library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
## The following object is masked from 'package:purrr':
## 
##     some
gapminder %>%
  filter(!is.na(`Imports of goods and services (% of GDP)`)& Year > 1990) %>%
  filter(continent == "Europe" | continent == "Asia") %>%
  summarise( leveneTest(y = `Imports of goods and services (% of GDP)`,group = continent))
## # A tibble: 2 x 3
##      Df `F value`  `Pr(>F)`
##   <int>     <dbl>     <dbl>
## 1     1      14.5  0.000185
## 2   210      NA   NA

The null hypothesis (homogeneity of variance) is rejected. We shall then apply a t test for different variances.

library(car)
gapminder %>%
  filter(!is.na(`Imports of goods and services (% of GDP)`) & Year > 1990) %>%
  filter(continent == "Europe" | continent == "Asia") %>%
  summarise(t.test(formula = `Imports of goods and services (% of GDP)` ~ continent,var.equal = F)$p.value)
## # A tibble: 1 x 1
##   `...$NULL`
##        <dbl>
## 1      0.178

We can’t reject the null hypothesis of equality of means: there isn’t a significant difference in post-1990 imports of goods and services (% of GDP) between Asia and Europe.

What is the country (or countries) that has the highest ‘Population density (people per sq. km of land area)’ across all years? (i.e., which country has the highest average ranking in this category across each time point in the dataset?)

gapminder %>%
  drop_na(`Population density (people per sq. km of land area)`) %>%
  group_by(Year) %>%
  mutate(ranking = min_rank(desc(`Population density (people per sq. km of land area)`))) %>%
  ungroup() %>%
  group_by(`Country Name`) %>%
  summarize(mean_ranking = mean(ranking)) %>%
  arrange(mean_ranking)
## # A tibble: 262 x 2
##    `Country Name`            mean_ranking
##    <chr>                            <dbl>
##  1 Macao SAR, China                   1.5
##  2 Monaco                             1.5
##  3 Hong Kong SAR, China               3.1
##  4 Singapore                          3.9
##  5 Gibraltar                          5  
##  6 Bermuda                            6.2
##  7 Malta                              7  
##  8 Bangladesh                         9.2
##  9 Channel Islands                    9.4
## 10 Sint Maarten (Dutch part)         10.5
## # ... with 252 more rows

The Macao Special Administrative Region within China and the city state of Monaco top the ranking of population density rankings, followed by other small countries. The only considerably-sized country in the list is Bangladesh.

What country (or countries) has shown the greatest increase in ‘Life expectancy at birth, total (years)’ since 1962?

gapminder %>%
  filter(`Country Name` %in% as.matrix(gapminder[!is.na(gapminder$`Life expectancy at birth, total (years)`) & gapminder$Year == 1962,"Country Name"])) %>% 
  filter(Year == 1962 | Year == 2007) %>%
  group_by(`Country Name`) %>%
  arrange(Year) %>%
  mutate(Difference = `Life expectancy at birth, total (years)`[Year == 2007] - `Life expectancy at birth, total (years)`[Year == 1962]) %>%
  select(Difference) %>%
  slice_max(Difference)
## Adding missing grouping variables: `Country Name`
## # A tibble: 472 x 2
## # Groups:   Country Name [236]
##    `Country Name`      Difference
##    <chr>                    <dbl>
##  1 Afghanistan               24.6
##  2 Afghanistan               24.6
##  3 Albania                   12.3
##  4 Albania                   12.3
##  5 Algeria                   25.9
##  6 Algeria                   25.9
##  7 Angola                    15.6
##  8 Angola                    15.6
##  9 Antigua and Barbuda       12.2
## 10 Antigua and Barbuda       12.2
## # ... with 462 more rows

Afghanistan has shown the greatest increase.